Preparations

Load the necessary libraries

library(mgcv)      #for GAMs
library(gratia)    #for GAM plots
library(emmeans)   #for marginal means etc
library(broom)     #for tidy output
library(MuMIn)     #for model selection and AICc
library(lubridate) #for processing dates
library(tidyverse) #for data wrangling
library(DHARMa)    #for residuals diagnostics
library(performance) #for residual disagnostics
library(see)        # to visualize residual diagnostics
library(patchwork)  #for grids of plots
theme_set(theme_classic())

Scenario

The Australian Institute of Marine Science (AIMS) have a long-term inshore marine water quality monitoring program in which water samples are collected and analysed from sites (reef_alias) across the GBR numerous times per year. The focus of this program is to report long-term condition and change in water quality parameters.

Although we do have latitude and longitudes, the nature of the spatial design predominantly reflects a series of transects that start near the mouth of a major river and extend northwards, yet mainly within the open coastal zone. As a result, this design is not well suited to any specific spatial analyses (since they are mainly one dimensional).

AIMS water quality monitoring

Format of aims.wq.csv data file

latitude longitude reef_alias water_samples region subregion season water_year nox
-16.1 145. Cape Trib… AIMS Wet T… Barron D… Dry 2008 0.830
-16.1 145. Cape Trib… AIMS Wet T… Barron D… Wet 2008 0.100
-16.1 145. Cape Trib… AIMS Wet T… Barron D… Dry 2009 0.282
-16.1 145. Cape Trib… AIMS Wet T… Barron D… Wet 2009 1.27
-16.1 145. Cape Trib… AIMS Wet T… Barron D… Dry 2009 0.793
-16.1 145. Cape Trib… AIMS Wet T… Barron D… Dry 2010 0.380
... ... ... ... ... ... ... ... ...
latitude - Latitudinal coordinate
longitude - Longitudinal coordinate
reef_alias - Internal AIMS reef name
water_samples - Categorical label of who collected the data
region - The MMP region
subregion - The MMP subregion
season - A categorical listing of Wet or Dry
water_year - The water year (1st Oct - 30 Sept) to which the data are attached
date - The date the sample was collected
nox - Nitrite + Nitrate added together

Read in the data

wq <- read_csv('../data/aims.wq.csv', trim_ws=TRUE) %>%
  janitor::clean_names() %>%
  select(latitude, longitude, reef_alias, water_samples, region, subregion, 
         season, water_year, month = mnth, date, nox = n_ox)
wq <- wq %>% 
  mutate(reef_alias = factor(reef_alias),
         region = factor(region),
         subregion = factor(subregion),
         season = factor(season))
glimpse(wq)
## Rows: 618
## Columns: 11
## $ latitude      <dbl> -16.11152, -16.11383, -16.11261, -16.11308, -16.11343, …
## $ longitude     <dbl> 145.4833, 145.4827, 145.4844, 145.4868, 145.4835, 145.4…
## $ reef_alias    <fct> Cape Tribulation, Cape Tribulation, Cape Tribulation, C…
## $ water_samples <chr> "AIMS", "AIMS", "AIMS", "AIMS", "AIMS", "AIMS", "AIMS",…
## $ region        <fct> Wet Tropics, Wet Tropics, Wet Tropics, Wet Tropics, Wet…
## $ subregion     <fct> Barron Daintree, Barron Daintree, Barron Daintree, Barr…
## $ season        <fct> Dry, Wet, Dry, Wet, Dry, Dry, Wet, Dry, Dry, Wet, Dry, …
## $ water_year    <dbl> 2008, 2008, 2009, 2009, 2009, 2010, 2010, 2010, 2011, 2…
## $ month         <dbl> 10, 3, 10, 2, 6, 10, 3, 6, 10, 2, 6, 10, 2, 6, 10, 2, 6…
## $ date          <date> 2007-10-14, 2008-03-30, 2008-10-12, 2009-02-26, 2009-0…
## $ nox           <dbl> 0.8300237, 0.0100000, 0.2819196, 1.2675291, 0.7934017, …

Exploratory data analysis

wq %>%
  ggplot(aes(x = date, y = nox, col = subregion)) +
  geom_point() +
  facet_grid(season ~ region)

wq %>%
  ggplot(aes(x = date, y = nox, col = subregion)) +
  geom_point() +
  facet_wrap( ~ reef_alias, scales = "free_y") +
  geom_hline(yintercept=0) +
  geom_smooth(col = "red") +
  scale_y_continuous(trans = scales::pseudo_log_trans())

Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\beta_0 + f(Date_i) + f(Month_i) \]

where \(\beta_0\) is the y-intercept. \(f(Date)\) and \(f(Month)\) indicate the additive smoothing functions of the long-term temporal trends and the annual seasonal trends respectively.

Data preparations

Use package lubridate to get dates into decimals as a numerical variable:

wq <- wq %>% mutate(dt = decimal_date(date))

We’ll start with a simple model only.

Simple model (Green only)

wq1 <- wq %>% filter(reef_alias == "Green") %>%
  filter(complete.cases(nox))

wq1 %>% ggplot(aes(x = date, y = nox)) +
  geom_smooth() +
  scale_y_log10() +
  scale_x_date(expand = expansion()) +
  geom_point()

Maybe not so much this example, but other examples show that Gaussian may not work super well. However, we can try other distributions, such as a gamma (but replace all 0 detections with an arbitrarily small number), lognormal, or a tweedie.

wq_gam1 <- gam(nox ~ s(dt), data = wq1, family = gaussian, method = "REML")
# pseudo log-normal
wq_gam2 <- gam(nox ~ s(dt), data = wq1, family = gaussian(link = "log"),
               method = "REML")
wq_gam3 <- gam(nox ~ s(dt), data = wq1, family = Gamma(link = "log"),
               method = "REML")
wq_gam4 <- gam(nox ~ s(dt), data = wq1, family = tw(link = "log"),
               method = "REML")
AICc(wq_gam1, wq_gam2, wq_gam3, wq_gam4) %>% arrange(AICc)

Gamma is best (wq_gam3), Tweedie is fairly similar, but on the basis of complexity, gamma is better.

Model validation

k.check(wq_gam3) # not over-constrained
##       k'     edf  k-index p-value
## s(dt)  9 3.07189 1.463834       1
appraise(wq_gam3) # looks ok overall.

concurvity(wq_gam3) # concurvity ok! Close to zero.
##                  para        s(dt)
## worst    3.097166e-23 3.096820e-23
## observed 3.097166e-23 1.023266e-30
## estimate 3.097166e-23 1.116082e-25

The fact that the observed vs. predicted is not so tight isn’t necessarily surprising, considering we only have one predictor at present!

simulateResiduals(wq_gam3, plot=T)

## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
##  
## Scaled residual values: 0 0.992 0.076 0.752 0.632 0.56 0.224 0.288 0.288 0.28 0.88 0.184 0.744 0.512 0.624 0.784 0.412 0.552 0.72 0.856 ...

There’s a slight worry about non-linearity in the residuals, not that the bottom line is red, but rather, that they all trend towards going upwards. Not a show stopper, but definitely would like to improve on this prior to selecting a final model.

draw(wq_gam3, residuals = F)

Exploring more models

What if there were different temporal trends for wet vs. dry seasons? This how you would model this…

wq_gam5 <- gam(nox ~ s(dt, by = season), data = wq1, family = Gamma(link = "log"),
               method = "REML")
simulateResiduals(wq_gam5, plot = T)

## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
##  
## Scaled residual values: 0 0.964 0.204 0.896 0.512 0.632 0.308 0.384 0.26 0.268 0.988 0.112 0.712 0.748 0.56 0.676 0.684 0.492 0.644 0.988 ...
k.check(wq_gam5)
##                 k'      edf  k-index p-value
## s(dt):seasonDry  9 2.883745 1.230821   0.910
## s(dt):seasonWet  9 1.191068 1.230821   0.905
draw(wq_gam5)

What about monthly cycling, or annual cycling? We can look at this using a more fine distinction of month and day of the year effects, as a cyclical effect.

Need to tell it how many months the basis functions it is.

wq_gam6 <- gam(nox ~ s(dt) +
                 s(month, bs = 'cc', k = 5, fx = FALSE), 
               knots = list(month = seq(1,12, length = 5)),
               data = wq1, family = Gamma(link = "log"),
               method = "REML")

Mixed model (all reefs)

Fit a random effects model per reef:

wq_gam7 <- gam(nox ~ s(dt, by = region) +
                 s(month, bs = 'cc', by = region, k = 5) +
                 s(reef_alias, bs = "re"), 
               knots = list(month = seq(1,12, length = 5)),
               family = Gamma(link = "log"),
               data = wq, method = "REML")
k.check(wq_gam7)
##                                  k'          edf   k-index p-value
## s(dt):regionBurdekin              9 6.298137e+00 0.7493435  0.0000
## s(dt):regionMackay Whitsunday     9 3.970266e+00 0.7493435  0.0000
## s(dt):regionWet Tropics           9 7.138526e+00 0.7493435  0.0000
## s(month):regionBurdekin           3 1.517644e+00 0.8689999  0.1200
## s(month):regionMackay Whitsunday  3 2.323597e+00 0.8689999  0.1300
## s(month):regionWet Tropics        3 7.884133e-04 0.8689999  0.1225
## s(reef_alias)                    28 2.241402e+01        NA      NA

Possibly more complexity in the date effect.

wq_gam8 <- gam(nox ~ s(dt, by = region, k = 20) +
                 s(month, bs = 'cc', by = region, k = 5) +
                 s(reef_alias, bs = "re"), 
               knots = list(month = seq(1,12, length = 5)),
               family = Gamma(link = "log"),
               data = wq, method = "REML")
k.check(wq_gam8)
##                                  k'          edf   k-index p-value
## s(dt):regionBurdekin             19 6.887115e+00 0.7588022  0.0000
## s(dt):regionMackay Whitsunday    19 4.235643e+00 0.7588022  0.0000
## s(dt):regionWet Tropics          19 8.888716e+00 0.7588022  0.0000
## s(month):regionBurdekin           3 1.541980e+00 0.8727190  0.1300
## s(month):regionMackay Whitsunday  3 2.322221e+00 0.8727190  0.1150
## s(month):regionWet Tropics        3 6.593186e-04 0.8727190  0.1225
## s(reef_alias)                    28 2.241898e+01        NA      NA

Model investigation / hypothesis testing

summary(wq_gam8)
## 
## Family: Gamma 
## Link function: log 
## 
## Formula:
## nox ~ s(dt, by = region, k = 20) + s(month, bs = "cc", by = region, 
##     k = 5) + s(reef_alias, bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1199     0.1464   0.819    0.413
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(dt):regionBurdekin             6.887e+00  8.496 11.925  < 2e-16 ***
## s(dt):regionMackay Whitsunday    4.236e+00  5.263  3.620  0.00272 ** 
## s(dt):regionWet Tropics          8.889e+00 10.930 18.653  < 2e-16 ***
## s(month):regionBurdekin          1.542e+00  3.000  1.099  0.11036    
## s(month):regionMackay Whitsunday 2.322e+00  3.000  9.877 6.07e-07 ***
## s(month):regionWet Tropics       6.593e-04  3.000  0.000  0.70889    
## s(reef_alias)                    2.242e+01 27.000  5.265  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.243   Deviance explained = 47.6%
## -REML = 684.18  Scale est. = 1.1562    n = 568

Everything by date is wiggly, the Burdekin and Wet Tropics are not wiggly, but the Mackay/Whitsundays are wiggly!

draw(wq_gam8)

Further analyses

emmeans(wq_gam8, pairwise ~ dt, at = list(dt = c(2014,2016)), type = "response")
## $emmeans
##    dt response    SE  df lower.CL upper.CL
##  2014    2.355 0.406 521    1.678     3.31
##  2016    0.933 0.115 521    0.732     1.19
## 
## Results are averaged over the levels of: reef_alias, region 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
##  contrast    ratio    SE  df null t.ratio p.value
##  2014 / 2016  2.52 0.475 521    1   4.921  <.0001
## 
## Results are averaged over the levels of: reef_alias, region 
## Tests are performed on the log scale

NOx is 2.52 times higher in 2014 vs. 2016!

emmeans(wq_gam8, pairwise ~ dt|region, at = list(dt = c(2014,2016)), type = "response") %>% confint()
## $emmeans
## region = Burdekin:
##    dt response    SE  df lower.CL upper.CL
##  2014    1.923 0.634 521    1.006     3.67
##  2016    0.840 0.195 521    0.533     1.32
## 
## region = Mackay Whitsunday:
##    dt response    SE  df lower.CL upper.CL
##  2014    2.421 0.825 521    1.239     4.73
##  2016    1.080 0.276 521    0.653     1.79
## 
## region = Wet Tropics:
##    dt response    SE  df lower.CL upper.CL
##  2014    2.805 0.538 521    1.925     4.09
##  2016    0.895 0.118 521    0.690     1.16
## 
## Results are averaged over the levels of: reef_alias 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
## region = Burdekin:
##  contrast    ratio    SE  df lower.CL upper.CL
##  2014 / 2016  2.29 0.863 521     1.09     4.80
## 
## region = Mackay Whitsunday:
##  contrast    ratio    SE  df lower.CL upper.CL
##  2014 / 2016  2.24 0.763 521     1.15     4.37
## 
## region = Wet Tropics:
##  contrast    ratio    SE  df lower.CL upper.CL
##  2014 / 2016  3.13 0.755 521     1.95     5.03
## 
## Results are averaged over the levels of: reef_alias 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale

Burdekin: 2.29x Mackay-Whitsunday: 2.24x Wet Tropics: 3.13x

fx means that it does not force the model to have k=5, but just sets k to a max of 5. Defaults to fx = FALSE.

Summary figures

References